You (The Wholesale Customer Dataset from UCI Machine Learning Repository1) are a wholesale distributor in 3 regions: “Oporto” with 47 customers, “Lisbon” with 77 customers and “Other” with 316 customers, among them 298 are from the “Cafe” (including Hotels and Restaurants) distribution channel and 142 from “Retail”.
library(cluster)
library(dplyr)
library(GGally)
library(ggplot2)
library(knitr)
library(pander)
library(MASS)
library(reshape2)
library(ROCR)
sale = read.csv("WholesaleCustomer.csv", header=TRUE)
colnames(sale)[7] = "Detergents" # rename "Detergents_Paper" for brevity
sale = within(sale, {Channel[Channel == 1] <- "Cafe"
Channel[Channel == 2] <- "Retail"
Region[Region == 1] <- "Lisbon"
Region[Region == 2] <- "Oporto"
Region[Region == 3] <- "Other" } )
sale[,c("Channel", "Region")] = lapply(sale[,c("Channel", "Region")], as.factor)
Here is few rows from your annual revenue dataset:
kable(head(sale), align="c")
| Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents | Delicassen |
|---|---|---|---|---|---|---|---|
| Retail | Other | 12669 | 9656 | 7561 | 214 | 2674 | 1338 |
| Retail | Other | 7057 | 9810 | 9568 | 1762 | 3293 | 1776 |
| Retail | Other | 6353 | 8808 | 7684 | 2405 | 3516 | 7844 |
| Cafe | Other | 13265 | 1196 | 4221 | 6404 | 507 | 1788 |
| Retail | Other | 22615 | 5410 | 7198 | 3915 | 1777 | 5185 |
| Retail | Other | 9413 | 8259 | 5126 | 666 | 1795 | 1451 |
Here is how many customers with a specific revenue range for each product for each “Cafe” or “Retail” group you have
The first important thing to notice is that the histograms look reasonably bell-shaped at the log-scale of revenues. Thus we transform our data from revenues to log-revenues.
Do log-numbers look unusual to you? Then you may ask: are there any other advantages of using those? Yes, log-revenues automatically care about the “percentage” difference between revenues (120 is 20% larger than 100, and 1200 is also 20% larger than 1000) rather than absolute difference. You can also think of log-value as “order of magnitude” or “count of digits in a number”. Log-revenue of 2 means revenue of 100, log-revenue of 3 means revenue of 1000, log-revenue of 4 means revenue 10000, log-revenue of 2.5 means revenue is closer to 1000 than to 100.
products = colnames(sale[,3:8]); #products
sale$Total = rowSums(sale[,products]) # total revenue
l1sale = sale; l1sale[,3:9] = log10(1.+(sale[,3:9])) # df with logs
From the histograms above it looks like that mean revenues of Milk, Grocery and Detergent (but not necessarily of Fresh, Frozen and Delicassen) are significantly different for Cafe and Retail customers. There are parametric and non-parametric ways to test this hypothesis, with the key word being ANOVA=Analysis of variance. In such analysis the groups are considered and given and revenues as dependent variables.
Strictly speaking, ANOVA makes an assumptions on normality(which should rather be called “gaussianity”), homoscedasticity, and independence of observations. For the data set under consideration there is no doubt about independence. The other two properties are not necessarily satisfied. Practically speaking, there is a growing discussion, that many statistical methods are robust to deviations from gaussianity, and a reasonable bell-shape form of the histogram (as in our case) suffice for application of those methods. Even more, there are opinions, that normality tests should actually never be applied.
Anyway one can first resort to non-parametric Kruskal-Wallis test. Small p-values (smaller than a chosen threshold, typically 0.05) show that we can reject the hypothesis that the mean values of revenues are the same for Retail and Cafe customers. This is so for each product separately as well as for the total revenue.
apply(l1sale[,3:9], 2, function(x) kruskal.test(x ~ Channel, data=l1sale)$p.value)
## Fresh Milk Grocery Frozen Detergents Delicassen Total
## 1.83e-04 7.67e-38 6.60e-50 7.39e-07 4.56e-55 5.49e-04 1.72e-24
Two-sample t-test assumes gaussian distributions, but not necessarily homoscedasticity. It leads to similar results
apply(l1sale[,3:9], 2, function(x) t.test(x ~ Channel, data=l1sale, var.equal=FALSE)$p.value)
## Fresh Milk Grocery Frozen Detergents Delicassen Total
## 5.76e-04 3.26e-46 1.84e-68 1.39e-07 2.26e-88 5.14e-03 4.70e-28
If we consider classical anova applicable, the results are again similar
apply(l1sale[,3:9], 2, function(x) anova(lm(x ~ Channel, data=l1sale))[["Pr(>F)"]][1])
## Fresh Milk Grocery Frozen Detergents Delicassen Total
## 3.56e-04 1.84e-40 5.34e-57 1.94e-07 1.51e-73 5.84e-03 1.49e-22
Notice, however, that p-values for Fresh, Frozen and Delicassen are more than 25 orders of magnitude higher than for Milk, Grocery and Detergents. This reflects somehow our intuition from the histograms.
The differences between groups can be illustrated with boxplots (keep in mind though that the central line in boxplot shows median, not the mean value)
For different regions, p-values are large. Thus we cannot reject the hypothesis than mean revenues are the same
apply(l1sale[,3:9], 2, function(x) kruskal.test(x ~ Region, data=l1sale)$p.value)
## Fresh Milk Grocery Frozen Detergents Delicassen Total
## 0.586 0.482 0.202 0.239 0.797 0.890 0.778
Similar to the two-sample t-test, three groups can also be tested to have the same means
apply(l1sale[,3:9], 2, function(x) oneway.test(x ~ Region, data=l1sale, var.equal=FALSE)$p.value)
## Fresh Milk Grocery Frozen Detergents Delicassen Total
## 0.713 0.609 0.117 0.151 0.708 0.947 0.903
Results from the classical anova are also similar
apply(l1sale[,3:9], 2, function(x) anova(lm(x ~ Region, data=l1sale))[["Pr(>F)"]][1])
## Fresh Milk Grocery Frozen Detergents Delicassen Total
## 0.658 0.623 0.211 0.193 0.704 0.957 0.917
That difference between groups is small can be illustrated with the box- and violin-plots.
Here is the revenue comparison for different channels and different regions
From the histograms above it looks like that the group of Retail customers differ from the group of Cafe customers in the revenues of Milk, Grocery and Detergent, while not necessarily in Fresh, Frozen and Delicassen. Is it indeed so? To formulate this question more precisely: what combinations of revenues allows one to maximally separate various groups? In such analysis the revenues are considered and given and groups as dependent variables.
The most popular method is LDA. Omitting a discussion on it assumptions (multivariate gaussianity and identical covariance matrices for each group), I proceed with finding the linear combination that best separated classes
lda_ch = lda(Channel ~ ., data=l1sale[,c(1,3:8)], CV=FALSE)
lda_ch$scaling
## LD1
## Fresh -0.0966
## Milk 0.4631
## Grocery 0.6784
## Frozen -0.3453
## Detergents 1.3510
## Delicassen 0.0145
The largest contributions to it are coming from Milk, Grocery and Detergent. Keeping in mind our histograms, this does not come as a surprise. Here is the density plot versus this linear discriminant “LD1”
plot(lda_ch)
The mean values of each log-revenues for each group are
lda_ch$means
## Fresh Milk Grocery Frozen Detergents Delicassen
## Cafe 3.87 3.34 3.45 3.27 2.58 2.85
## Retail 3.64 3.92 4.13 2.97 3.73 3.00
The quality of separation can be estimated by the ratio of the between- and within-group standard deviations
lda_ch$svd
## [1] 23.7
To estimate the quality of classification produced by the LDA we should use not the training set, but resort to cross validation. Since leave-one-out cross validation is built-in, let us use it. Here are the confusion matrix and the total percent of correctly classified customers
lda_ch_cv = lda(Channel ~ ., data=l1sale[,c(1,3:8)], CV=TRUE)
ct_lda <- table(l1sale$Channel, lda_ch_cv$class); ct_lda # confusion matrix
##
## Cafe Retail
## Cafe 273 25
## Retail 13 129
#diag(prop.table(ct, 1)) # percent correct for each Channel
sum(diag(prop.table(ct_lda))) # total percent correct
## [1] 0.914
This method can also provides the posterior probabilities of belonging to each group. Thus, we can asses the quality of the model with respect to more refined measures. Below, after discussing QDA, I will plot the ROC curve and calculate AUC.
QDA differs from LDA in that it does not assume a common covariance matrix for all classes. This leads to more free parameters and thus to larger flexibility of the method. Since QDA uses quadratic functions, there are no easily interpretable linear discriminants. We can use this method to make predictions for unknown samples (here via leave-one-out cross validation). This results in the following confusion matrix and the total percent of correctly classified customers. The results appears to be a little worse than those for LDA
qda_ch_cv = qda(Channel ~ ., data=l1sale[,c(1,3:8)], CV=TRUE, method="mle")
ct_qda <- table(l1sale$Channel, qda_ch_cv$class); ct_qda # confusion matrix
##
## Cafe Retail
## Cafe 269 29
## Retail 13 129
#diag(prop.table(ct, 1)) # percent correct for each Channel
sum(diag(prop.table(ct_qda))) # total percent correct
## [1] 0.905
roc_auc = function(da){
pred = prediction(predictions=da$posterior[,"Retail"],
labels=as.numeric(l1sale$Channel=="Retail"))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
roc.auc = performance(pred, measure="auc");
return(list(perf=perf, auc=roc.auc@y.values[[1]]))
}
lda_roc_auc = roc_auc(lda_ch_cv)
qda_roc_auc = roc_auc(qda_ch_cv)
cutoffs = c(0.2,0.4, 0.5, 0.6, 0.8)
plot(lda_roc_auc$perf, colorize=TRUE, lwd=3,
main="ROC curve for LDA and QDA prediction of Retail channel",
print.cutoffs.at=cutoffs, text.adj=c(1.0,-0.5))
plot(qda_roc_auc$perf, colorize=TRUE, lwd=3, main="ROC curve for LDA prediction of Retail channel",
add=TRUE)
text(-0.03, 0.99, "LDA", pos=4)
text(0.1, 0.85, "QDA", pos=4)
text(0.15,0.6,paste0("LDA AUC = ", round(lda_roc_auc$auc,3)), pos=4)
text(0.2,0.5,paste("cutoff positions at", paste(cutoffs, collapse=", ")), pos=4)
text(0.15,0.3,paste0("QDA AUC = ", round(qda_roc_auc$auc,3)), pos=4)
The results of discriminant analysis should typically be compared with other classification methods such as logistic regression, support vector machines, neural nets.
Revenues for some products are correlated with each other
l1sale.cor = cor(l1sale[,products]); l1sale.cor
## Fresh Milk Grocery Frozen Detergents Delicassen
## Fresh 1.0000 -0.0211 -0.133 0.3863 -0.159 0.256
## Milk -0.0211 1.0000 0.761 -0.0552 0.679 0.342
## Grocery -0.1330 0.7611 1.000 -0.1645 0.797 0.240
## Frozen 0.3863 -0.0552 -0.165 1.0000 -0.213 0.256
## Detergents -0.1587 0.6787 0.797 -0.2128 1.000 0.168
## Delicassen 0.2564 0.3423 0.240 0.2563 0.168 1.000
Let us try to determine the (independent) latent variables responsible for the variety of the observed revenues. Suitable here is exploratory factor analysis.
faa2 = factanal(l1sale[,products], factors=2, method="mle", rotation="varimax", scores="regression")
print(faa2, digits = 2, cutoff = .2, sort = TRUE)
##
## Call:
## factanal(x = l1sale[, products], factors = 2, scores = "regression", rotation = "varimax", method = "mle")
##
## Uniquenesses:
## Fresh Milk Grocery Frozen Detergents Delicassen
## 0.65 0.30 0.13 0.61 0.26 0.65
##
## Loadings:
## Factor1 Factor2
## Milk 0.83
## Grocery 0.93
## Detergents 0.84
## Fresh 0.58
## Frozen 0.61
## Delicassen 0.31 0.51
##
## Factor1 Factor2
## SS loadings 2.38 1.01
## Proportion Var 0.40 0.17
## Cumulative Var 0.40 0.57
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 4.41 on 4 degrees of freedom.
## The p-value is 0.353
The chi-square statistic and p-value in “factanal” are the results of testing the hypothesis that the model fits the correlation matrix perfectly. When the p value is large (larger than your confidence level, typically 0.05), as it is here, we cannot reject this hypothesis. Thus, 2 factors are enough to explain the empirical covariance matrix.
Let us check how well the true correlation matrix is approximated by the fitted one by computing difference between them. As expected, all differences are very small.
faa2.cor.approx <- faa2$loadings %*% t(faa2$loadings) + diag(faa2$uniquenesses)
round(l1sale.cor - faa2.cor.approx, 3)
## Fresh Milk Grocery Frozen Detergents Delicassen
## Fresh 0.000 -0.011 -0.001 0.021 0.014 -0.010
## Milk -0.011 0.000 -0.001 -0.014 -0.003 0.027
## Grocery -0.001 -0.001 0.000 0.008 0.002 -0.007
## Frozen 0.021 -0.014 0.008 0.000 -0.001 -0.012
## Detergents 0.014 -0.003 0.002 -0.001 0.000 -0.010
## Delicassen -0.010 0.027 -0.007 -0.012 -0.010 0.000
Since factors are determined up to rotations, we have chosen the “varimax” rotation. This sparsifies the contribution of individual products and thus makes interpretation easier.
2 factors can explain 57% of data variance.
Factor1: mainly the sum of revenues for Milk, Grocery and Detergents
Factor2: mainly the sum of Fresh, Frozen and Delicassen
The exact contribution of each product to factor is shown here
ggplot(as.data.frame(unclass(faa2$loadings)),
aes(x=Factor1, y=Factor2) ) + geom_text(label=products) +
theme_bw(base_size = 16) + ggtitle("Contribution of products to the two latent factors")
And here are the two factors (found by “factanal”) for each customer
Large uniquenesses show, however, that there is much more in this data than can be explained by two factors.
faa2$uniqueness
## Fresh Milk Grocery Frozen Detergents Delicassen
## 0.654 0.296 0.133 0.613 0.263 0.646
For each of the Fresh, Frozen and Delicassen more than 60% of the variability of revenue is not explained by the common factors.
Cluster analysis answer the questions: which customers belong together and can be combined in a group such that this group is noticeably different from other groups (formed on the same principle). Here I choose to work only with revenue data, in order to be able to check how the found clusters are related to known distribution channels and regions.
Cluster analysis is very sensitive to the meaningful distance between customers. And defining what “meaningful distance between customers” is, is crucial for inferring business value.
A business user should ask himself
A mathematically aware business user should also ask himself
with 440 customers and 6 products I am operating with 440 points in 6D space. Does this makes sense or am I (my data, actually) under the curse of dimensionality? Should I perform dimensionality reduction?
how different is different? or, in mathematical terms, is my distance euclidean, manhattan or something else?
As good business user as I am, I am choosing to care about the difference to the mean, reduce dimensionality to 4, and use manhattan distance.
scaled.pca4 = princomp(scale(l1sale[,products]))$scores[,1:4]
di = dist(scaled.pca4, method="manhattan") # manhattan distance
For agglomerative hierarchical clustering I choose “Ward” clustering method. It minimizes the total within-cluster variance and thus leads to most compact clusters possible.
ag = agnes(di, diss=TRUE, method="ward")
plot(ag, which=2, labels=FALSE, xlab="", sub="",
main="Dendogram: 4-comp PCA, manhattan distance, Ward agglomeration") # dendrogram
The dendrogram suggest that 2 or 3 clusters are meaningful. Let us investigate 3 clusters
l1sale$ag3 <- factor(cutree(ag, k = 3))
table(l1sale$Channel, l1sale$ag3, dnn=c("Channel", "ag3") )
## ag3
## Channel 1 2 3
## Cafe 22 50 226
## Retail 56 81 5
The third cluster consists mainly of Cafe customers, while the first and the second one include both Cafe and Retail. What about the revenues for each product?
This set of boxplots shows that Group 3 (nearly all of those Cafe customers) buy less Milk, Grocery and Detergents than the other two groups. Among groups with mixed Cafe and Retail customers, groups 1 (30% Cafe 70% Retail low-revenue customers) buys less Fresh, Frozen and Delicassen that group 2 (40% Cafe 60% Retail high-revenue customers). That’s how these groups look like, when I choose axis as sum of revenues
Thus, the 3 groups found are
40/60 mix Cafe Retail with low-to-middle total revenues due to low Fresh, Frozen
30/70 mix Cafe/Retail with high total revenues
Cafe customers with low total revenues due to low Milk, Grocery and Detergents
Let us depict customers versus easily interpretable axes: sums of revenues of specific products. The 3 “groups” found do not look like compact at all.
Can it be related to the “unfortunate” choice of axes? Let us try with the “most fortunate” possible axis (the first two PCA components)
clusplot(l1sale[,products], l1sale$ag3, color=TRUE, shade=FALSE, labels=3, lines=0,
col.p=group_color[as.integer(l1sale$ag3)],
col.clus=group_color[c(1,3,2)],
main="3 groups found by hierarchical clustering")
No success either. How shall we interpret this?
A good tool to assess the quality of clustering is the silhouette diagram. It shows which objects lie well within their cluster, and which ones are merely somewhere in between clusters.
si.ag3 = silhouette(as.numeric(l1sale$ag3), di);
plot.silhouette(si.ag3, col=group_color,
main="Silhouette plot for 3 groups from aglomerative clustering")
With more than few negative values the silhouette diagram does not look good. With average silhouette widths of 0.10, 0.29 and 0.25, the clustering should be assessed as “no substantial structure has been found”.
Let us look at the 3 cluster from the point of view of K-medoids algorithms. This can be considered as a more robust version of k-means, which searches for k representative objects (“medoids”) among data points. These “medoid customers” should represent the “typical customer” of the group.
pam3 = pam(di, k=3, diss=TRUE, cluster.only=FALSE)
l1sale$pam3 = as.factor(pam3$cluster)
table(l1sale$Channel, l1sale$pam3, dnn=c("Channel", "pam3")) # compare clusters distribution Channels
## pam3
## Channel 1 2 3
## Cafe 20 162 116
## Retail 115 2 25
Groups 2 and 3 consist mainly of Cafe customers, while group 1 from Retail customers. The separation of distribution channels here is better than one achieved in the agglomerative hierarchical clustering. What is the difference in revenues?
Thus, the 3 groups found are
mostly Retail customers with high Milk, Grocery, Detergent revenues
mostly Cafe customers with low Milk, Grocery, Detergent revenues
mostly Cafe customers with medium Milk, Grocery, Detergent revenues
The difference between Fresh, Frozen and Delicassen for all groups is relatively small
Here are the log-revenues of the “typical customers” found
l1sale[pam3$medoids, c("pam3", products, "Total")]
## pam3 Fresh Milk Grocery Frozen Detergents Delicassen Total
## 417 1 3.64 4.04 4.04 2.93 3.83 3.00 4.54
## 116 2 4.05 2.90 3.48 3.43 2.44 2.79 4.27
## 315 3 4.03 3.25 3.88 3.17 2.93 3.09 4.37
With sums of revenues of specific products on axis, here are the 3 customer “groups”
This is a pretty much different clustering than the one found by the agglomerative method.
Does partitioning around medoids indeed found distinct clusters in this dataset? Silhouette plot (low values of silhouette width) again indicates no significant structure in data.
si.pam3 = silhouette(as.numeric(l1sale$pam3), di)
plot.silhouette(si.pam3, col=group_color, main="Silhouette plot for 3 groups found by PAM")
Having done the cluster analysis, several questions are pending
Do these results actually mean that our customers cannot be structured?
Is it meaningful for this data to assume that there are several distinct groups and enforce searching for distinct groups?
If one still wants to search for distinct groups, it may be reasonable to reconsider what the “distance between customers” mean. As it was shown with histograms and boxplots, the main difference lies in 3 products out of 6. At the same time, the distance considered so far tried to take into account all 6. The ways out of this situation (and out of the curse of dimensionality) I will discuss in the more extensive “Wholesale Customers: cluster analysis” write-up. Ideally it would be nice not to guess the appropriate distance matrix, but let the algorithm to determine it.
Here I discuss another question: why are we actually trying so hard to “put a single label” on each customer?
Similar customers are in one segment. Neighbor segments are similar to each other in some products.
The upper circles in this visualization are for Milk, Grocery, Detergents. Large revenues of those, as we know from the previous analysis are typical for Retail customers.
Some segments are pretty interesting. Here we display actual revenues instead of log-revenues
| ID | Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents | Delicassen | Total |
|---|---|---|---|---|---|---|---|---|---|
| 67 | Cafe | Other | 9 | 1534 | 7417 | 175 | 3468 | 27 | 12630 |
| 129 | Cafe | Other | 140 | 8847 | 3823 | 142 | 1062 | 3 | 14017 |
| 185 | Cafe | Other | 327 | 918 | 4710 | 74 | 334 | 11 | 6374 |
| 204 | Cafe | Lisbon | 583 | 685 | 2216 | 469 | 954 | 18 | 4925 |
| 234 | Cafe | Lisbon | 964 | 4984 | 3316 | 937 | 409 | 7 | 10617 |
| 413 | Cafe | Other | 97 | 3605 | 12400 | 98 | 2970 | 62 | 19232 |
| 440 | Cafe | Other | 2787 | 1698 | 2510 | 65 | 477 | 52 | 7589 |
| ID | Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents | Delicassen | Total |
|---|---|---|---|---|---|---|---|---|---|
| 76 | Cafe | Other | 20398 | 1137 | 3 | 4407 | 3 | 975 | 26923 |
| 123 | Cafe | Other | 12212 | 201 | 245 | 1991 | 25 | 860 | 15534 |
| 162 | Cafe | Other | 12434 | 540 | 283 | 1092 | 3 | 2233 | 16585 |
| 357 | Cafe | Other | 22686 | 134 | 218 | 3157 | 9 | 548 | 26752 |
| ID | Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents | Delicassen | Total |
|---|---|---|---|---|---|---|---|---|---|
| 17 | Retail | Other | 1020 | 8816 | 12121 | 134 | 4508 | 1080 | 27679 |
| 39 | Retail | Other | 4591 | 15729 | 16709 | 33 | 6956 | 433 | 44451 |
| 58 | Retail | Other | 5417 | 9933 | 10487 | 38 | 7572 | 1282 | 34729 |
| 60 | Cafe | Other | 6137 | 5360 | 8040 | 129 | 3084 | 1603 | 24353 |
| 61 | Retail | Other | 8590 | 3045 | 7854 | 96 | 4095 | 225 | 23905 |
| 95 | Retail | Other | 5626 | 12220 | 11323 | 206 | 5038 | 244 | 34657 |
| 107 | Retail | Other | 1454 | 6337 | 10704 | 133 | 6830 | 1831 | 27289 |
| 146 | Retail | Other | 22039 | 8384 | 34792 | 42 | 12591 | 4430 | 82278 |
| 156 | Retail | Other | 1989 | 10690 | 19460 | 233 | 11577 | 2153 | 46102 |
| 176 | Retail | Other | 2343 | 7845 | 11874 | 52 | 4196 | 1697 | 28007 |
| 217 | Retail | Lisbon | 2532 | 16599 | 36486 | 179 | 13308 | 674 | 69778 |
| 222 | Cafe | Lisbon | 5396 | 7503 | 10646 | 91 | 4167 | 239 | 28042 |
| 246 | Retail | Lisbon | 3062 | 6154 | 13916 | 230 | 8933 | 2784 | 35079 |
| 265 | Retail | Lisbon | 1073 | 9679 | 15445 | 61 | 5980 | 1265 | 33503 |
| 334 | Retail | Oporto | 8565 | 4980 | 67298 | 131 | 38102 | 1215 | 120291 |
| 419 | Retail | Other | 660 | 8494 | 18622 | 133 | 6740 | 776 | 35425 |
| 421 | Cafe | Other | 4456 | 5266 | 13227 | 25 | 6818 | 1393 | 31185 |
| ID | Channel | Region | Fresh | Milk | Grocery | Frozen | Detergents | Delicassen | Total |
|---|---|---|---|---|---|---|---|---|---|
| 24 | Retail | Other | 26373 | 36423 | 22019 | 5154 | 4337 | 16523 | 110829 |
| 88 | Cafe | Other | 43265 | 5025 | 8117 | 6312 | 1579 | 14351 | 78649 |
| 182 | Cafe | Other | 112151 | 29627 | 18148 | 16745 | 4948 | 8550 | 190169 |
| 184 | Cafe | Other | 36847 | 43950 | 20170 | 36534 | 239 | 47943 | 185683 |
| 203 | Cafe | Lisbon | 25203 | 11487 | 9490 | 5065 | 284 | 6854 | 58383 |
| 266 | Cafe | Lisbon | 5909 | 23527 | 13699 | 10155 | 830 | 3636 | 57756 |
| 326 | Cafe | Oporto | 32717 | 16784 | 13626 | 60869 | 1272 | 5609 | 130877 |
| 385 | Cafe | Other | 10683 | 21858 | 15400 | 3635 | 282 | 5120 | 56978 |
| 436 | Cafe | Other | 29703 | 12051 | 16027 | 13135 | 182 | 2204 | 73302 |
Do you want to know more? Actually how similar are customers in a given segment? How similar is a segment to its neighbors? I can tell you about this personally. I also found 3 more segments that arise questions and opportunities.
https://archive.ics.uci.edu/ml/datasets/Wholesale+customers The data set is originated from a larger database referred on: Abreu, N. (2011). Analise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em Marketing, ISCTE-IUL, Lisbon↩